R Programming

Introduction

The iris dataset includes measurements of 150 iris flowers from three species. This report combines visualizations, and statistical methods for an end-to-end analysis.

1. Inspecting the Dataset

First, let’s get a feel for the data

1.1 Loading Dataset

In R, there are many ways to load data, depending on the format of your data. Here we will load data from the dataset that come with R using the library() method

library(datasets) 

1.2 View the first few rows of the dataset

head(iris) will show you the first 6 rows, showing the column names and a sample of the data.

head(iris)

1.3 Get the structure of the dataset (data types, number of observations/variables)

str() show the type of data frame, number of observations (rows) and the number of the variables (columns). It will also show the data type of each column (e.g., num for numeric, Factor for species).

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

1.4 Get the column names

The name() displays the colunm headings for the dataset

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

1.5 Get summary statistics for each column

The summary() provides the minimum (min), 1st quartile, median, mean, 3rd quartile, and maximum (max) for the numerical columns, and the counts for each column (level of the Species factor).

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

1.6 Descriptive Statistics

We examine summary statistics to understand the central tendency and dispersion of each numeric variable. Summary of numeric features to understand central tendency and spread. Loads the psych package, which provides more detailed descriptive statistics.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
library(psych)
describe(iris[, 1:4])

1.7 Group-wise Summary by Species

Compare mean feature values across species.

aggregate(. ~ Species, data = iris[, 1:5], mean)

2. Visualisation

To visualise the data we use the plot() method.

2.1 Creating a Basic Scatter Plot

A scatter plot showing the relationship between Sepal Length and Petal Length, colored by species.

plot(iris$Petal.Length)

2.1.1 Customising the Plot

The plot() function has many arguments you can use to customize your graph. Here are some of the most common ones:

  • main: Sets the main title of the plot.
  • xlab: Sets the label for the x-axis.
  • ylab: Sets the label for the y-axis.
  • col: Changes the color of the points.
  • pch: Changes the plotting symbol (e.g., circles, triangles, squares).
  • cex: Changes the size of the plotting symbols.
  • type: Specifies the type of plot.
  • “p” for points (default)
  • “l” for lines
  • “b” for both points and lines
  • “o” for both points and lines, overlaid
  • “h” for histogram-like vertical lines
  • “s” for stair steps
  • “n” for no plotting (useful for setting up axes first)
plot(iris$Sepal.Length, iris$Petal.Length,
     main = "Sepal vs Petal Length",
     xlab = "Sepal Length",
     ylab = "Petal Length",
     col = "blue",
     pch = 16
)

2.1.2 Further Customisation Plots

To reveal how Petal Length varies with Sepal Length across species. Helpful in spotting clusters or patterns. Sepal Length (X) vs Petal Length (Y). Points are colored by Species (setosa, versicolor, virginica). A legend shows which color corresponds to which species.

plot(iris$Sepal.Length, iris$Petal.Length,
     col = iris$Species,
     pch = 16,
     main = "Sepal vs Petal Length by Species",
     xlab = "Sepal Length",
     ylab = "Petal Length")
    legend("topleft", legend = levels(iris$Species), col = 1:3, pch = 16)

2.2 Box Plot

Quickly compare the central tendency and spread of Petal Length by species. A boxplot displays the distribution of Petal Length across the three species, highlighting medians and variability. Our box plot shows the Petal Length by Species

boxplot(Petal.Length ~ Species, data = iris, col = c("lightblue", "lightgreen", "lightpink"))

2.3 Histogram

A histogram helps visualize the frequency distribution of Sepal Width for all flowers. Checks if Sepal Width is normally distributed or skewed, and to see the spread of values. Below is the Histogram of Sepal Width

hist(iris$Sepal.Width, col = "skyblue", border = "white")

2.4 Pairwise Scatterplots (Scatterplot Matrix)

This matrix of scatterplots shows pairwise relationships among all four numeric variables, with species color-coded

pairs(iris[1:4], main = "Pairwise Plot", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

3.0 Cleaner Visualization

A ggplot2() version of the scatter plot offering cleaner visualization and built-in legends.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point(size = 2) +
  labs(title = "Petal vs Sepal Length", x = "Sepal Length", y = "Petal Length") +
  theme_minimal()

3.1 Density Plot

A density plot compares the distribution of Sepal Width across species using smooth curves.

ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Sepal Width") +
  theme_minimal()

3.2 Interactive Visualization

You can use the plotly() package:

### install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(iris, x = ~Sepal.Length, y = ~Petal.Length,
        type = 'scatter', mode = 'markers',
        color = ~Species)

🔎 Statistical Analysis

4.0 Correlation Matrix

We compute correlations between numeric variables to assess linear relationships.

cor(iris[, 1:4])
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

4.1 Correlation Heatmap

cor_matrix <- cor(iris[, 1:4])
heatmap(cor_matrix, col = colorRampPalette(c("white", "blue"))(20),
        main = "Correlation Matrix of Iris Features")

4.3 ANOVA

ANOVA tests if the mean Petal Length is significantly different across the three species. We will use the aov() function

anova_model <- aov(Petal.Length ~ Species, data = iris)
summary(anova_model)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  437.1  218.55    1180 <2e-16 ***
## Residuals   147   27.2    0.19                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.4 Post-hoc Test (Tukey HSD)

After an ANOVA test tells us that at least one group mean is different, it does not tell us which groups differ. This is where Tukey’s Honest Significant Difference (HSD) test TukeyHSD() comes in.

anova_model <- aov(Petal.Length ~ Species, data = iris)
TukeyHSD(anova_model)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Petal.Length ~ Species, data = iris)
## 
## $Species
##                       diff     lwr     upr p adj
## versicolor-setosa    2.798 2.59422 3.00178     0
## virginica-setosa     4.090 3.88622 4.29378     0
## virginica-versicolor 1.292 1.08822 1.49578     0

4.5 Linear Regression

We fit a linear model using the lm() to predict Petal Length from Sepal Length and evaluate its explanatory power

lm_model <- lm(Petal.Length ~ Sepal.Length, data = iris)
summary(lm_model)
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47747 -0.59072 -0.00668  0.60484  2.49512 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
## Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7583 
## F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

4.6 Regression Diagnostic Plots

plot(lm_model)

4.7 Principal Component Analysis (PCA)

PCA function prcomp() reduces dimensionality while preserving as much variance as possible, helping us visualize data structure.

4.7.1 PCA summary

pca <- prcomp(iris[, 1:4], scale. = TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

4.7.2 Scree Plot

The scree plot shows the variance explained by each principal component.

plot(pca, type = "l", main = "Scree Plot of PCA")

4.7.3 PCA Biplot

A biplot shows both the scores of each sample and the contribution of each variable.

biplot(pca, col = c("gray", "red"))

4.4 PCA Visualization with ggplot2

A clearer visualization of PCA results using the first two principal components, colored by species.

pca_df <- as.data.frame(pca$x)
pca_df$Species <- iris$Species

library(ggplot2)
ggplot(pca_df, aes(x = PC1, y = PC2, color = Species)) +
  geom_point(size = 2) +
  labs(title = "PCA of Iris Dataset", x = "PC1", y = "PC2") +
  theme_minimal()

5.0 Loading from External File

5.1 Installs pacman (“package manager”) if needed

if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman

5.2 Use pacman to load add-on packages as desired

pacman::p_load(pacman, rio) 

5.3 Loading RIO

getwd()
## [1] "C:/Users/FINANCE/Documents/Divine/BYU/CSE 310 Applied Programming/Data Analysis Project"
install.packages("rio")
## Warning: package 'rio' is in use and will not be installed
library(rio)

5.4 Importing CSV

rio_csv <- import("mbb.csv")
head(rio_csv)

5.5 Importing TXT

rio_txt <- import("mbb.txt")
head(rio_txt)

5.6 Excel XLSX

rio_xlsx <- import("mbb.xlsx")
head(rio_xlsx)

DATA VIEWER

?View
## starting httpd help server ... done
View(rio_csv)

READ.TABLE FOR TXT FILES

R’s built-in function for text files (used by rio)

TEXT FILES

Load a spreadsheet that has been saved as tab-delimited

text file. Need to give complete address to file. This

command gives an error on missing data but works on

complete data.

r_txt1 <- read.table("mbb.txt", header = TRUE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : number of items read is not a multiple of the number of columns

This works with missing data by specifying the separator:

s for tabs, sep = “,” for commas. R converts missing

to “NA”

r_txt2 <- read.table("mbb.txt", 
  header = TRUE, 
  sep = "\t")

READ.CSV FOR CSV FILES

R’s built-in function for csv files (also used by rio)

CSV FILES

Don’t have to specify delimiters for missing data

because CSV means “comma separated values”

trends.csv <- read.csv("mbb.csv", header = TRUE)

CLEAN UP

Clear environment

rm(list = ls()) 

Clear packages

p_unload(all)  # Remove all add-ons
## The following packages have been unloaded:
## rio, pacman, plotly, ggplot2, psych

Clear console

cat("\014")  # ctrl+L

Clear mind :)

Key takeaways

We combined visual and statistical tools to analyze the classic iris dataset.

  • Petal Length differs significantly by species (ANOVA).

  • Sepal Length is a strong predictor of Petal Length (Regression).

  • PCA helps reduce dimensions while revealing structure across species.

  • This workflow is a powerful template for analyzing other classification datasets.